feat: Add `report_progress` to `TrainContext` #9826

gt2345 · 2024-08-15T15:30:11Z

Ticket

Description

Add report_progress to TrainContext

Test Plan

Find an experiment using report_progress from TrainContext, such as examples/tutorials/core_api/1_metrics.py.
Start the experiment, and while the experiment is running, monitor the experiment to verify the progress changes

Checklist

Changes have been manually QA'd
New features have been approved by the corresponding PM
User-facing API changes have the "User-facing API Change" label
Release notes have been added as a separate file under docs/release-notes/
See Release Note for details.
Licenses have been included for new code which was copied and/or modified from any external code

netlify · 2024-08-15T15:30:30Z

✅ Deploy Preview for determined-ui canceled.

Name	Link
🔨 Latest commit	`bdfb80a`
🔍 Latest deploy log	https://app.netlify.com/sites/determined-ui/deploys/66be9a4014baa60008368bab

codecov · 2024-08-15T15:30:32Z

Codecov Report

Attention: Patch coverage is 25.00000% with 15 lines in your changes missing coverage. Please review.

Project coverage is 54.37%. Comparing base (91d0b67) to head (bdfb80a).
Report is 21 commits behind head on main.

Files	Patch %	Lines
harness/determined/core/_train.py	28.57%	5 Missing ⚠️
master/internal/api_trials.go	0.00%	4 Missing ⚠️
master/internal/experiment.go	0.00%	4 Missing ⚠️
master/internal/db/postgres_experiments.go	0.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #9826      +/-   ##
==========================================
- Coverage   54.38%   54.37%   -0.01%     
==========================================
  Files        1261     1261              
  Lines      155770   155787      +17     
  Branches     3540     3540              
==========================================
- Hits        84711    84709       -2     
- Misses      70921    70940      +19     
  Partials      138      138

Flag	Coverage Δ
backend	`45.22% <23.07%> (-0.02%)`	⬇️
harness	`72.60% <28.57%> (-0.02%)`	⬇️
web	`53.70% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
harness/determined/core/_searcher.py	`84.78% <ø> (ø)`
master/internal/trials/postgres_trials.go	`29.67% <100.00%> (+0.86%)`	⬆️
master/internal/db/postgres_experiments.go	`57.40% <0.00%> (-0.18%)`	⬇️
master/internal/api_trials.go	`55.48% <0.00%> (-0.18%)`	⬇️
master/internal/experiment.go	`30.12% <0.00%> (-0.09%)`	⬇️
harness/determined/core/_train.py	`39.20% <28.57%> (-0.64%)`	⬇️

... and 3 files with indirect coverage changes

azhou-determined · 2024-08-15T15:53:35Z

harness/determined/core/_train.py

+
+        The ``progress`` should be the actual progress.
+        """
+        logger.debug("report_progress()")


let's put the progress value in the log statement

also, we should probably validate > 0 < 1 here too

azhou-determined · 2024-08-15T16:21:10Z

harness/determined/core/_train.py

@@ -260,6 +260,19 @@ def report_early_exit(self, reason: EarlyExitReason) -> None:
        if r.status_code == 400:
            logger.warn("early exit has already been reported for this trial, ignoring new value")

+    def report_progress(self, progress: float) -> None:


this method also needs to be in DummyTrainContext. otherwise local training will not work

azhou-determined · 2024-08-15T17:23:59Z

harness/determined/core/_train.py

+        can show accurate progress to users.
+
+        The ``progress`` should be the actual progress.
+        """


nit: wording, and style. i know we're not consistent with style for docstrings in this class, but let's do it for new methods. we mostly try to follow the google style guide

suggestion:

""" Report training progress to the master. This is optional for training, but will be used by the WebUI to render completion status. Progress must be reported as a float between 0 and 1.0, where 1.0 is 100% completion. It should represent the current iteration step as a fraction of maximum training steps (i.e.: `report_progress(step_num / max_steps)`). Note that for hyperparameter search, progress should be reported through ``SearcherOperation.report_progress()`` in the Searcher API instead. Arguments: progress (float): completion progress in the range [0, 1.0]. """

azhou-determined · 2024-08-15T17:31:01Z

master/internal/api_trials.go

@@ -1406,6 +1406,7 @@ func (a *apiServer) ReportTrialProgress(
 	msg := experiment.TrialReportProgress{
 		RequestID: rID,
 		Progress:  searcher.PartialUnits(req.Progress),


if it's not too much effort, it' be great if we could get rid of this searcher.PartialUnits type, it's just a float anyway. will need to do it sooner or later, but not strictly necessary as part of this PR.

Since it touches multiple files and not directly related to this PR, I created a ticket for it

azhou-determined · 2024-08-15T17:43:43Z

master/internal/experiment.go

+	}
+	if progress < 0 || progress > 1 {
+		e.syslog.Errorf("Invalid progress value: %f", progress)
+		return nil


this should return an error, otherwise we'll return a 200 in the HTTP POST

Sure, but I'm curious why the DB error next line is ignored?

it should also be returned, feel free to update it if you want. the only reason i can think of to not return it is if you have a one-off DB connection failure or something and you don't want to exit training because progress reporting isn't important.

this case is important tho IMO because if progress isn't [0,1] then that means user code is wrong and they should know to fix it.

azhou-determined · 2024-08-15T17:53:57Z

e2e_tests/tests/experiment/test_core.py

@@ -482,6 +485,22 @@ def test_core_api_distributed_tutorial() -> None:
    )


+@pytest.mark.e2e_cpu


this feature is logically simple enough that i don't think it's worth the time/resource/maintenence cost to have an e2e test, tbh.

if you wanted you could probably add a unit test for it, but i also think that's overkill, since it's just hitting an existing API.

azhou-determined · 2024-08-15T17:56:18Z

examples/tutorials/core_api/1_metrics.py

@@ -24,6 +24,8 @@ def main(core_context, increment_by):
            core_context.train.report_training_metrics(
                steps_completed=steps_completed, metrics={"x": x}
            )
+            # NEW: report training progress.
+            core_context.train.report_progress(batch/100.0)


the denominator is confusing here, because it looks like we're calculating out of 100% or something. would be better if we defined max_length=100 as a variable outside this loop, and used it in the for batch in range(max_length) and also here.

shouldn't be batch, it should be steps_completed

azhou-determined · 2024-08-15T18:01:04Z

examples/tutorials/core_api/1_metrics.py

include this change in 2_checkpoints.py, too. they're meant to be incremental tutorials, hence the "# NEW: ..."

also update detached mode tutorials (and make sure this works in detached mode. it should, but just in case)

I don't think this function would work with detached mode out of the box, because in the existing method, we retrieve experiment from experiment.ExperimentRegistry, but currently we do not include unmanaged experiments in ExperimentRegistry. So we can either choose to include unmanaged experiment in ExperimentRegistry, or retrieve unmanaged experiment from DB instead.

azhou-determined · 2024-08-15T21:11:32Z

harness/determined/core/_train.py

@@ -312,6 +336,9 @@ def upload_tensorboard_files(
    def report_early_exit(self, reason: EarlyExitReason) -> None:
        logger.info(f"report_early_exit({reason})")

+    def report_progress(self, progress: float) -> None:
+        logger.info(f"report_progres with progress={progress}")


azhou-determined

a few other small things for technical correctness:

the check

if progress < 0 || progress > 1 {
		return errors.Errorf("Invalid progress value: %f", progress)
	}

should be moved up to (a *apiServer) ReportTrialProgress since now we're saving to db for unmanaged exps upfront.

(a *apiServer) PatchTrial should be updated to set progress to 1.0 when unmanaged trials exit. currently if your last progress report was .90, it'd never reach 100%. not sure what this means for the web ui, but not a big deal.

azhou-determined · 2024-08-15T21:25:32Z

nice work! thanks for helping out with this ticket! 🙏

maxrussell

Backend LGTM

master/internal/experiment.go

gt2345 added 6 commits August 14, 2024 16:32

add report_progress function to TrainContext

135cea3

fix naming

1ad0420

add test

ea9e6a3

fix

17f0626

fix

aa5dcd2

cast float

8181ae8

gt2345 requested review from a team as code owners August 15, 2024 15:30

gt2345 requested review from azhou-determined and maxrussell August 15, 2024 15:30

cla-bot bot added the cla-signed label Aug 15, 2024

gt2345 changed the title ~~Gt/499 report progress~~ feat: Add report_progress to TrainContext Aug 15, 2024

gt2345 added 2 commits August 15, 2024 10:36

pytest mark

0ab9fba

e2e

d3b1454

azhou-determined reviewed Aug 15, 2024

View reviewed changes

gt2345 added 2 commits August 15, 2024 14:55

comments

63f667a

doc

aa5330e

determined-ci added the documentation Improvements or additions to documentation label Aug 15, 2024

determined-ci requested a review from a team August 15, 2024 20:15

detached mode

2fb9a39

azhou-determined reviewed Aug 15, 2024

View reviewed changes

azhou-determined approved these changes Aug 15, 2024

View reviewed changes

maxrussell approved these changes Aug 15, 2024

View reviewed changes

gt2345 added 2 commits August 15, 2024 19:02

update progress for detached mode

c3a28d2

fix

bdfb80a

gt2345 commented Aug 19, 2024

View reviewed changes

master/internal/experiment.go Show resolved Hide resolved

gt2345 merged commit 13622ad into main Aug 19, 2024
82 of 96 checks passed

gt2345 deleted the gt/499-report-progress branch August 19, 2024 19:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add `report_progress` to `TrainContext` #9826

feat: Add `report_progress` to `TrainContext` #9826

gt2345 commented Aug 15, 2024 •

edited by jira bot

Loading

netlify bot commented Aug 15, 2024 •

edited

Loading

codecov bot commented Aug 15, 2024 •

edited

Loading

azhou-determined Aug 15, 2024

azhou-determined Aug 15, 2024

azhou-determined Aug 15, 2024

azhou-determined Aug 15, 2024

azhou-determined Aug 15, 2024

gt2345 Aug 15, 2024

azhou-determined Aug 15, 2024

gt2345 Aug 15, 2024

azhou-determined Aug 15, 2024

azhou-determined Aug 15, 2024

azhou-determined Aug 15, 2024

azhou-determined Aug 15, 2024

gt2345 Aug 15, 2024

azhou-determined Aug 15, 2024

azhou-determined left a comment

azhou-determined commented Aug 15, 2024

maxrussell left a comment

		@@ -482,6 +485,22 @@ def test_core_api_distributed_tutorial() -> None:
		)


		@pytest.mark.e2e_cpu

feat: Add report_progress to TrainContext #9826

feat: Add report_progress to TrainContext #9826

Conversation

gt2345 commented Aug 15, 2024 • edited by jira bot Loading

Ticket

Description

Test Plan

Checklist

netlify bot commented Aug 15, 2024 • edited Loading

✅ Deploy Preview for determined-ui canceled.

codecov bot commented Aug 15, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

azhou-determined left a comment

Choose a reason for hiding this comment

azhou-determined commented Aug 15, 2024

maxrussell left a comment

Choose a reason for hiding this comment

feat: Add `report_progress` to `TrainContext` #9826

feat: Add `report_progress` to `TrainContext` #9826

gt2345 commented Aug 15, 2024 •

edited by jira bot

Loading

netlify bot commented Aug 15, 2024 •

edited

Loading

codecov bot commented Aug 15, 2024 •

edited

Loading